Unbiased Multivariate Correlation Analysis

نویسندگان

  • Yisen Wang
  • Simone Romano
  • Vinh Nguyen
  • James Bailey
  • Xingjun Ma
  • Shu-Tao Xia
چکیده

Correlation measures are a key element of statistics and machine learning, and essential for a wide range of data analysis tasks. Most existing correlation measures are for pairwise relationships, but real-world data can also exhibit complex multivariate correlations, involving three or more variables. We argue that multivariate correlation measures should be comparable, interpretable, scalable and unbiased. However, no existing measures satisfy all these requirements. In this paper, we propose an unbiased multivariate correlation measure, called UMC, which satisfies all the above criteria. UMC is a cumulative entropy based non-parametric multivariate correlation measure, which can capture both linear and non-linear correlations for groups of three or more variables. It employs a correction for chance using a statistical model of independence to address the issue of bias. UMC has high interpretability and we empirically show it outperforms state-of-the-art multivariate correlation measures in terms of statistical power, as well as for use in both subspace clustering and outlier detection tasks. Introduction Analysing correlations is a fundamental task in both statistics and machine learning. It has applications in many realworld learning tasks, e.g., feature selection (Brown et al., 2012), subspace search (Nguyen et al., 2013), causal inference (Bareinboim, Tian, and Pearl, 2014) and subspace clustering (Kriegel, Kröger, and Zimek, 2009). For the setting of continuous variables (as opposed to discrete), most existing correlation measures focus on pairwise relationships. For example, Pearson’s correlation coefficient detects bivariate linear correlations, and Maximal Information Coefficient (MIC) (Reshef et al., 2011) detects both linear and nonlinear bivariate correlations. However, real-world data often contains three or more variables which can exhibit multivariate (higher-order) correlations. If bivariate based measures are used to identify multivariate correlations, through pairwise aggregation, multivariate correlations can potentially be overlooked. For example, it has been shown that genes may reveal only a weak correlation with a disease when considered individually, while the correlation for a group of genes may be very strong (Zhang et al., 2008). Copyright c © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2 3 4 5 6 7 8 9 10 Dimension 0.0 0.2 0.4 0.6 0.8 1.0 S c o re UDS for functional relationship UDS for independent variables UDS-r for functional relationship UDS-r for independent variables (a) UDS and UDS-r 2 3 4 5 6 7 8 9 10 Dimension 0.0 0.2 0.4 0.6 0.8 1.0

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The distance correlation t-test of independence in high dimension

AMS subject classifications: primary 62G10 secondary 62H20 Keywords: dCor dCov Multivariate independence Distance covariance Distance correlation High dimension a b s t r a c t Distance correlation is extended to the problem of testing the independence of random vectors in high dimension. Distance correlation characterizes independence and determines a test of multivariate independence for rand...

متن کامل

Multivariate geostatistical estimation using minimum spatial cross-correlation factors (Case study: Cubuk Andesite quarry, Ankara, Turkey)

The quality properties of andesite (Unit Volume Weight, Uniaxial Compression Strength, Los500, etc.) are required to determine the exploitable blocks and their sequence of extraction. However, the number of samples that can be taken and analyzed is restricted, and thus the quality properties should be estimated at unknown locations. Cokriging has been traditionally used in the estimation of spa...

متن کامل

Full correlation matrix analysis of fMRI data

Functional brain imaging produces huge amounts of data, of which only a fraction are analyzed. Existing univariate and multivariate analyses of brain activity ignore interactions between regions, and analyses of interactions (functional connectivity) are typically biased toward regions of interest chosen based on their activity profile. This technical report provides a provisional description o...

متن کامل

An unbiased Cp criterion for multivariate ridge regression

Mallows’ Cp statistic is widely used for selecting multivariate linear regression models. It can be considered to be an estimator of a risk function based on an expected standardized mean square error of prediction. Fujikoshi and Satoh (1997) have proposed an unbiased Cp criterion (called modified Cp; MCp) for selecting multivariate linear regression models. In this paper, the unbiased Cp crite...

متن کامل

Full correlation matrix analysis (FCMA): An unbiased method for task-related functional connectivity.

BACKGROUND The analysis of brain imaging data often requires simplifying assumptions because exhaustive analyses are computationally intractable. Standard univariate and multivariate analyses of brain activity ignore interactions between regions and analyses of interactions (functional connectivity) reduce the computational challenge by using seed regions of interest or brain parcellations. N...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017